Elsevier invoice data about hybrid open access articles

Publishers rarely make spending for hybrid open access articles transparent. Elsevier is an remarkable exception, because the publisher provides open and machine-readable data about central invoicing with funding bodies and fee waivers at the article level. This blogpost demonstrates how to mine these data from Elsevier full-texts with R. Analysing the resulting dataset of 70,657 hybrid open access articles published in 1,753 journals between 2015 and now reveals that around one third of publication fees were paid through central agreements. Nevertheless, the majority of funding sources for hybrid open access articles remains unknown.

Najko Jahn https://twitter.com/najkoja (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
Nov 25, 2019

Introduction and background

In September 2018, the cOAltion S, a group of international research funders, announced its widely discussed Plan S. According to its principles, funders or research organisations should cover open access publication fees, also known as article-processing charges (APC). In the case of hybrid open access, the funders intent to financially support this business model through transformative agreements, which are central invoicing agreements aiming at the transition of subscription-based journal publishing to fully open access. The cOAltionS will monitor compliance with its open access funding policy.

Because of lacking data, the monitoring of spending for hybrid open access and transformative agreements is limited. Although surveys(Solomon and Björk 2011; Dallmeier-Tiessen et al. 2011) suggest that already many authors do not pay publication fees themselves, keeping track of these funding streams is challenging, because publishers rarely share invoice data (Björk 2017). But also not all funders and research organisations make per-article information about financially supported open access available, despite examples like the British Charity Open Access Fund or the Open APC Initiative (Jahn and Tullney 2016).

At the SUB Göttingen, we will address this intransparent situation in a new project funded by the Deutsche Forschungsgemeinschaft (DFG) in its programme “Open Access Transition Agreements”(Holzer 2017). Building on our pilot, the interactive Shiny app Hybrid OA Journal Monitor, this project will investigate data needs from German library consortia and how they can be addressed by metadata requirements in transformative agreements. Case-studies and data products will monitor levels of compliance with policy recommendations. Here, invoice data are essential to make the various funding streams for hybrid open access articles visible.

Against this background, this blogpost presents a dataset comprising publicly available invoice data for hybrid open access articles from Elsevier, a major publisher of scholarly journals. This dataset brings together metadata from Crossref and information retrieved from open access full-texts. The methods used to obtain the data not only address key challenges to discover hybrid open access articles along with funding and affiliation information using open data and tools. Elsevier’s effort to share invoice recipients also serves as a good practice example for other publishers offering hybrid open access options and central agreements. It is, thus, relevant for standardisation efforts like the “ESAC Workflow Recommendations for Transformative Agreements” (Geschuhn and Stone 2017).

To demonstrate the potential of publisher-provided data for the monitoring of the transition of journals to open access, the dataset will be used to analyse the number and the proportion of hybrid open access articles in Elsevier journals. Drawing on Elsevier’s funding information, I will also investigate whether Elsevier sent invoices to authors or to funders and research organisations that likely have an central payment agreement with Elsevier, or if the fees were waived. Moreover, text-mined author email domains will provide a rough approximation of the affiliation of the first resp. corresponding author, an important data point for delineating open access funding. Finally, the publisher-provided invoice data will be compared with crowd-sourced spending data from the Open APC Initiative.

To allow for a data-driven discussion about Elsevier’s approach and its potential for monitoring transformative agreements, I made the resulting dataset openly available on GitHub along with the source code used to obtain the data.

Methods

As a start, I used the Elsevier publication fee price list, an openly available pdf document, to determine current hybrid open access journals in Elsevier’s journal portfolio. The rOpenSci tabulizer package (Leeper 2018) allowed to extract data about these journals from this file.

Then, I interfaced the Crossref REST API with the R package rcrossref (Chamberlain et al. 2019). The first API call retrieved facet field counts for license URLs and the yearly article volumes for the period 2015-19 for every journal. After matching Creative Commons license URLs indicating open access articles, a second API call retrieved article-level metadata per journal. Next, I used the metadata field delay-in-days to exclude delayed open access articles. Because a few records had different date formats, which were used for the delay calculation by Crossref, I allowed for a lag of 31 days.

Elsevier participates in the Crossref Text and Data Mining Services (Crossref-TDM) and provides access to full-texts as html and xml documents. Surprisingly, the xml representation not only contains the full-text, but also embedded metadata including information about open access sponsorship in the <core> node:


<openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openArchiveArticle>false</openArchiveArticle>
<openaccessSponsorName>
  BMBF - German Federal Ministry of Education and Research
</openaccessSponsorName>
<openaccessSponsorType>FundingBody</openaccessSponsorType>
<openaccessUserLicense>
  http://creativecommons.org/licenses/by/4.0/
</openaccessUserLicense>

Snapshot of open access metadata in Elsevier XML full-texts. https://api.elsevier.com/content/article/PII:S0169409X18301479?httpAccept=text/xml

After downloading the Elsevier full-texts with the crminer package(Chamberlain 2018), I extracted the above-highlighted open access informatiom from the xml documents.

Moreover, I parsed the first author email address, assuming that email domains roughly indicate the affiliation of the first respective corresponding author at the time of publication. The package urltools (Keyes et al. 2019) enabled to extract email domains and to split them in meaningful parts.

Finally, to measure the overlap between crowd-sourced and publisher-provided invoice data, I downloaded spending data from the Open APC Initiative (Aasheim et al. 2019). To my knowledge, the Open APC Initiative maintains the largest evidence-base for institutional spending on open access publication fees.

Dataset characteristics

In the following data analysis, I will be using two files. The first file, journal_facets.json, contains the number of publications per Elsevier journal that offers hybrid open access options and year. It furthermore provides the various license URLs found through Crossref.

The second file, elsevier_hybrid_oa_df.csv, comprises article-level data. Each row holds information for a single hybrid open access article, and the columns represent:

Variable Description
doi DOI
license Open Content License
issued Earliest publication date
issued_year Earliest publication year
issn ISSN, a journal identifier
journal_title The title of the journal
journal_volume Yearly publication volume
tdm_link Link to the XML full-text
oa_sponsor_type Invoice recipient type
oa_sponsor_name Institution that directly received an invoice
oa_archive Was open access provided through Elsevier’s open archive programme, in which articles are made openly available after an embargo?
host Email host, e.g. med.cornell.edu
tld Top-level domain, e.g. edu
suffix Extracted suffix from domain name as defined by the public suffix list, e.g. ac.uk
domain Email domain, e.g. cornell.edu
subdomain Email subdomain, e.g. med

It must be noted, however, that Elsevier did not provide an official documentation of its open access and invoice data at the time of writing this blogpost.

Results

In total, 1,753 out of 1,990 Elsevier journals with an open access option published at least one open access article between 2015 and now, corresponding to about 88 %. In these journals, 70,657 hybrid open access articles appeared. The total share of hybrid open access in the publication volume of Elsevier journals was 2.4 %.

What is the uptake of hybrid open access among Elsevier journals?

The hybrid open access share varied across Elsevier journals. Figure 1, which replicates a boxplot aesthetics from The Economist magazine using the ggeconodist package (Rudis 2019), shows a slow, but steady hybrid open access uptake. The median open access proportion was around 3% in the first eleven months in 2019.

Hybrid open access uptake in Elsevier journals per year in percent, visualised as diminutive distribution chart. Since 2015, most journals have had a slow uptake rate of hybrid open access. In general, the hybrid open access publishing model played a marginal role compared to Elsevier's total publication volume. Data Source: Elsevier B.V / Crossref.

Figure 1: Hybrid open access uptake in Elsevier journals per year in percent, visualised as diminutive distribution chart. Since 2015, most journals have had a slow uptake rate of hybrid open access. In general, the hybrid open access publishing model played a marginal role compared to Elsevier’s total publication volume. Data Source: Elsevier B.V / Crossref.

How many payments for hybrid open access articles were facilitated by central invoicing?

In most cases, Elsevier sent invoices for hybrid open access publication fees to individual authors (59 %). For around 33 % of articles, the publisher directly charged funders and research organisations. Elsevier granted publication fee waivers to 6.2 % of hybrid open access articles.

Figure 2 shows the annual development per invocing type. Inspired by Claus O. Wilke’s “Fundamentals of Data Visualisation” (Wilke 2019), each type is visualised separately as parts of the total. The figure reveals a general growth of hybrid open access articles. It illustrates that this development was mainly driven by individual options to pay for hybrid open access publication fees, while central invoicing stagnated. Also the amount of fee-waived articles remained more or less constant during 2015 and now.

Development of fee-based hybrid open access publishing in Elsevier journals by invoicing type. Colored bars represent the invoice recipient, or if the fee was waived. Grey bars show the total number of hybrid open access articles published by Elsevier journals between 2015 and now. Data Source: Elsevier B.V / Crossref.

Figure 2: Development of fee-based hybrid open access publishing in Elsevier journals by invoicing type. Colored bars represent the invoice recipient, or if the fee was waived. Grey bars show the total number of hybrid open access articles published by Elsevier journals between 2015 and now. Data Source: Elsevier B.V / Crossref.

The following interactive visualisation (Figure 3), created with the echarts4r package(Coene 2019), let you browse the invoicing data.

Figure 3: Breakdown of Elsevier hybrid open access journal articles by invoice recipient. Each rectangle represents an invoicing type and can be broken down by recipient. Data Source: Elsevier B.V.

Clicking on “Agreement” shows the funders and research organisations that covered hybrid open access publication fees as part of an central agreement. In total, Elsevier disclosed 74 different institutions that received an invoice for open access publication. Not surprinsingly, mostly British and Dutch funders or consortia paid for hybrid open access in Elsevier journals. But also the German Federal Ministry of Education and Research (BMBF) is well represented despite the current boycott from most universities and research organisations in Germany (Else 2018). In fact, the BMBF is not part of the Alliance of Science Organisations in Germany, whose members decided to cancel Elsevier subscriptions in 2017. Since 2018, the BMBF has financially supported 181 hybrid open access articles that appeared in 129 Elsevier journals according to the publisher.

Who published hybrid open access in Elsevier journals?

In addition to funding information, email domains were parsed from Elsevier full-texts. These domains roughly indicate the affiliation of the first or of the corresponding authors, respectively, a data point used to delineate open access funding (Geschuhn and Stone 2017).

Email domain analysis of first resp. corresponding authors publishing hybrid open access in Elsevier journals. Around every fourth article published between 2015 and now was from an author affiliated with an UK-based academic institution. Data Source: Elsevier B.V.

Figure 4: Email domain analysis of first resp. corresponding authors publishing hybrid open access in Elsevier journals. Around every fourth article published between 2015 and now was from an author affiliated with an UK-based academic institution. Data Source: Elsevier B.V.

Figure 4 presents a breakdown by email domain suffix. In total, 67,900 email addresses were retrieved and parsed from Elsevier full-texts, corresponding to an share of 96 %. Most corresponding author emails originate from academic institutions in the UK (“ac.uk”), reflecting the country’s leading role in supporting hybrid open access publications (Pinfield, Salter, and Bath 2015). They are followed by domains from commercial organisations (“com”), and US-American institutions of higher education (“edu”). The figure illustrates that European institutions from Germany (“de”), the Netherlands (“nl”), and Sweden (“se”) were well represented. In total, 330 domain suffixes were retrieved.

In the following, a hierarchical, interactive treemap visualises the distribution of the email domains (see Figure 5). It appears that this distribution roughly represents the national research landscapes. However, the dominance of domains from commercial organisations, mostly email providers like “gmail.com” or the Chinese “163.com” and “126.com”, highlights the limitations of this approach to infer eligible funding institutions from author email addresses.

Figure 5: Email domain analysis of first resp. corresponding authors publishing hybrid open access in Elsevier journals. Each top-level domain can be subdivided further into domain names representing academic institutions or companies. Data Source: Elsevier B.V.

How does Elsevier invoice data compare to spending information from the Open APC Initiative?

Finally, I was interested in the overlap between publisher-provided invoice data from Elsevier and institutional spending data from the Open APC Initiative. In total, the Open APC Initiative tracked 8,213 out of 70,657 hybrid open access articles, corresponding to an share of 12 %. Institutional expenditures for these hybrid open access articles amounted to 24,008,889 € according to Open APC data. However, the Open APC Initiative listed another 683 hybrid open access articles. One likely explanation is that the Crossref metadata representing these articles did not meet my criteria, another that they appeared in journals that transitioned from hybrid to fully open access recently (e.g. the journal “NeuroImage”). At the journal level, the overlap was 58 %.

Figure 6 presents the annual development of spending disclosure for Elsevier hybrid open access articles to the Open APC Initiative grouped by invoincing type. The Open APC Initiative mostly tracked articles from central invoicing agreements. The figure also suggests that invoices that were sent to authors were covered by institutions participating in Open APC. Generally, results confirms a delay between invoicing and reporting to the Open APC Initiative (Jahn and Tullney 2016). Surprisingly, Open APC listed institutional payments for 13 articles, where Elsevier reported that the fee was waived.

Development of fee-based hybrid open access publishing in Elsevier journals by invoicing type and disclosure of institutional payment by the Open APC Initiative. Colored bars represent the number of articles that are also tracked in Open APC. Grey bars show the total number of hybrid open access articles published by invoicing type between 2015 and now. Data Sources: Crossref, Elsevier B.V., Open APC Initiative.

Figure 6: Development of fee-based hybrid open access publishing in Elsevier journals by invoicing type and disclosure of institutional payment by the Open APC Initiative. Colored bars represent the number of articles that are also tracked in Open APC. Grey bars show the total number of hybrid open access articles published by invoicing type between 2015 and now. Data Sources: Crossref, Elsevier B.V., Open APC Initiative.

Figure 7 presents the gap between publisher-provided invoice data and Open APC for the ten most important funding bodies. It highlights that British funders had the largest overlap rates, which reflects Open APC efforts to re-use openly available spending data from these institutions (Pieper and Broschinski 2018). On the other hand, Open APC did not track Dutch (“VSNU”), US-American (“Melinda & Bill Gates Foundation”) and European funding activities (“European Research Council”) for hybrid open access publication fees.

Proportion of fee-based hybrid open access articles from Elsevier disclosed by the Open APC Initiative. Blue areas represent joint spending data, grey areas centrally paid articles, which were not present in the Open APC data. Data Source: Crossref, Elsevier B.V., Open APC Initiative.

Figure 7: Proportion of fee-based hybrid open access articles from Elsevier disclosed by the Open APC Initiative. Blue areas represent joint spending data, grey areas centrally paid articles, which were not present in the Open APC data. Data Source: Crossref, Elsevier B.V., Open APC Initiative.

Discussion and conclusion

In this blog post, I have shown how to obtain invoice data from Elsevier. Embedded in the full-text, they give information on whether Elsevier sent invoices to authors or to funders and research organisation that likely have a central payment agreement with Elsevier, or if the fee was waived. Providing such machine-readable data, makes funding streams for hybrid open access more transparent.

At the same time, the data analysis highlights various critical aspects of hybrid open access. Despite increased funding activities, only a small proportion of journal articles were made openly available under this model. Furthermore, Elsevier sent the majority of invoices directly to the authors. This practise not only imposes administrative burdens and costs to all parties involved. It also shadows funding sources for publication fees. Existing spending data from funders and research organisations can only partly overcome this gap.

The Plan S is underway to change current practises of funding hybrid open access. Because Elsevier’s transparency is an remarkable exception, workflow guidelines for transformative agreements should consider generalising the publisher’s example to share invoice data. Although future work needs to tackle remaining questions about the data quality and coverage, publisher-provided invoice data make publishers more accountable and extends the evidence base about hybrid open access. As a result, the data analysis presented here provides a basis to improve the monitoring of funding streams in the context of transformative agreements.

Acknowledgments

This work was supported by the Deutsche Forschungsgemeinschaft, project “Hybrid OA Dashboards: Mehrwertorientierte Analytics-Anwendungen zur Förderung der Kostentransparenz bei Transformationsverträgen”, project id 416115939.

Aasheim, Jens Harald, Benjamin Ahlborn, Chelsea Ambler, Magdalena Andrae, Jochen Apel, Hans-Georg Becker, Roland Bertelmann, et al. 2019. The Open Apc Initiative. Bielefeld University Library. https://github.com/OpenAPC/openapc-de.

Björk, Bo-Christer. 2017. “Growth of Hybrid Open Access, 2009-2016.” PeerJ 5: e3878. https://doi.org/10.7717/peerj.3878.

Chamberlain, Scott. 2018. Crminer: Fetch ’Scholary’ Full Text from ’Crossref’. https://CRAN.R-project.org/package=crminer.

Chamberlain, Scott, Hao Zhu, Najko Jahn, Carl Boettiger, and Karthik Ram. 2019. Rcrossref: Client for Various ’Crossref’ ’Apis’. https://CRAN.R-project.org/package=rcrossref.

Coene, John. 2019. Echarts4r: Create Interactive Graphs with ’Echarts Javascript’ Version 4. http://echarts4r.john-coene.com/.

Dallmeier-Tiessen, Suenje, Robert Darby, Bettina Goerner, Jenni Hyppoelae, Peter Igo-Kemenes, Deborah Kahn, Simon C. Lambert, et al. 2011. “Highlights from the Soap Project Survey. What Scientists Think About Open Access Publishing.” http://arxiv.org/abs/1101.5260.

Else, Holly. 2018. “Dutch Publishing Giant Cuts Off Researchers in Germany and Sweden.” Nature 559 (7715): 454–55. https://doi.org/10.1038/d41586-018-05754-1.

Geschuhn, Kai, and Graham Stone. 2017. “It’s the Workflows, Stupid! What Is Required to Make ‘Offsetting’ Work for the Open Access Transition.” Insights: The UKSG Journal 30 (3): 103–14. https://doi.org/10.1629/uksg.391.

Holzer, Angela. 2017. “Wozu Open-Access-Transformationsverträge?” O-Bib. Das Offene Bibliotheksjournal 4 (2): 87–95. https://doi.org/10.5282/o-bib/2017H2S87-95.

Jahn, Najko, and Marco Tullney. 2016. “A Study of Institutional Spending on Open Access Publication Fees in Germany.” PeerJ 4 (August): e2323. https://doi.org/10.7717/peerj.2323.

Keyes, Os, Jay Jacobs, Drew Schmidt, Mark Greenaway, Bob Rudis, Alex Pinto, Maryam Khezrzadeh, et al. 2019. Urltools: Vectorised Tools for Url Handling and Parsing. https://CRAN.R-project.org/package=urltools.

Leeper, Thomas J. 2018. Tabulizer: Bindings for Tabula Pdf Table Extractor Library. https://cran.r-project.org/package=tabulizer.

Pieper, Dirk, and Christoph Broschinski. 2018. “OpenAPC: A Contribution to a Transparent and Reproducible Monitoring of Fee-Based Open Access Publishing Across Institutions and Nations.” Insights: The UKSG Journal 31. https://doi.org/10.1629/uksg.439.

Pinfield, Stephen, Jennifer Salter, and Peter A. Bath. 2015. “The "Total Cost of Publication" in a Hybrid Open-Access Environment: Institutional Approaches to Funding Journal Article-Processing Charges in Combination with Subscriptions.” Journal of the Association for Information Science and Technology 67 (7): 1751–66. https://doi.org/10.1002/asi.23446.

Rudis, Bob. 2019. Ggeconodist: Create Diminutive Distribution Charts. https://gitlab.com/hrbrmstr/ggeconodist.

Solomon, David J., and Bo-Christer Björk. 2011. “Publication Fees in Open Access Publishing: Sources of Funding and Factors Influencing Choice of Journal.” Journal of the Association for Information Science and Technology 63 (1): 98–107. https://doi.org/10.1002/asi.21660.

Wilke, Claus O. 2019. Fundamentals of Data Visualization. O’Reilly. https://serialmentor.com/dataviz/.